About the project

I am looking forward to get more first-hand experience with R and data visualization and analysis.

My GitHub:
https://github.com/AleksanKo/IODS-project


Regression and model validation

I have completed the data wrangling exercises: I’ve read the data from the file with the help of read.csv function, created a new dataset by taking columns from the old one and written it to a new file (write.csv).


Logistic regression

I have completed the data wrangling exercises: I’ve read the data from the file with the help of read.csv function, created a new dataset by taking columns from the old one and written it to a new file (write.csv).

Data description

matpor <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", sep=',')
colnames(matpor)
##  [1] "school"     "sex"        "age"        "address"    "famsize"    "Pstatus"   
##  [7] "Medu"       "Fedu"       "Mjob"       "Fjob"       "reason"     "nursery"   
## [13] "internet"   "guardian"   "traveltime" "studytime"  "failures"   "schoolsup" 
## [19] "famsup"     "paid"       "activities" "higher"     "romantic"   "famrel"    
## [25] "freetime"   "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

The dataset consists of 35 variables. Many factors has only two levels and are thus binary. Such factors include:

  • school (student’s school: ‘GP’ = Gabriel Pereira,‘MS’ = Mousinho da Silveira)
  • sex (‘F’ = female, ‘M’ = male)
  • address (where does the student live: ‘U’ = urban area, ‘R’ = rural area)
  • famsize (the number of family members: ‘LE3’ - less or equal to 3 family members, ‘GT3’ - greater than 3 family members))
  • Pstatus (whether the student’s parents are living together or not: ‘T’ - living together, ‘A’ - living apart)
  • nursery (if the student went to nursery school: yes/no)
  • internet (if the student has an Internet access at home: yes/no)
  • schoolsup (if the student gets extra educational support: yes/no)
  • famsup (if the student gets family educational support: yes/no)
  • paid (if the student has extra paid classes within the course subject (Math or Portuguese): yes/no)
  • activities (if the student has extra-curricular activities: yes/no)
  • higher (if the student wants to take higher education: yes/no)
  • romantic (if the student is in a romantic relationship: yes/no)

Other factor variables include:

  • Mjob (mother’s job: ‘teacher’, ‘health’, ‘services’ (e.g. administrative or police), ‘at_home’, ‘other’)
  • Fjob (father’s job: same levels as Mjob)
  • reason (the student’s reason to choose this school: ‘home’,‘reputation’, ‘course’, ‘other’)
  • guardian (the student’s guardian: ‘mother’, ‘father’, ‘other’)

Integer variables include (if it ranges from 1 to 5, it is a range between “very bad”/“very low” and “excellent”/“very high”):

  • age
  • Medu (mother’s education: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education, 4 - higher education)
  • Fedu (father’s education: same levels as Medu)
  • traveltime (home to school travel time: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, 4 - >1 hour)
  • studytime (weekly study time: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, 4 - >10 hours)
  • failures
  • famrel (quality of family relationships: from 1 to 5)
  • freetime (free time after school: from 1 to 5)
  • goout (going out with friends: from 1 to 5)
  • Dalc (workday alcohol consumption: from 1 to 5)
  • Walc (weekend alcohol consumption: from 1 to 5)
  • health (current health status: from 1 to 5)
  • absences
  • G1 (first period grade)
  • G2 (second period grade)
  • G3 (final grade)
  • alc_use (average use of alcohol during the week)

There is also a logical variable high_use which shows if the average alcohol consumption is high (more than 2) or not.

Hypotheses

I have chosen 4 variables: activities, famrel, higher and internet. My hypotheses for them are following:

  1. Having extra-curricular activities correlates with low alcohol consumption.
  2. Having good (from 3 to 5) family relationships correlates with low alcohol consumption.
  3. Wish to get a higher education correlates with low alcohol consumption.
  4. Having Internet connection at home correlates with low alcohol consumption.

Plots

Plot for Hypothesis 1: It seems that there are more students with high alcohol consumption amongst those who don’t have any activities. However, the data shows (both visually and mathematically) that the number of students who have a high alcohol consumption is almost the same:

n1<-nrow(filter(matpor,activities=="yes",high_use==TRUE))
n2<-nrow(filter(matpor,activities=="no",high_use==TRUE))

The difference between low alcohol consumption of the active and non-active students is also not so significant:

n3<-nrow(filter(matpor,activities=="yes",high_use==FALSE))
n4<-nrow(filter(matpor,activities=="no",high_use==FALSE))

Plot for Hypothesis 2:

It seems like most of the students ranked their family relationships as good or excellent (>250 students) which makes further analysis difficult. However, it is worth noting that the maximal amount of students with a hish alcohol consumption comes from a family with good internal relationships.

Plot for Hypothesis 3:

There is also no correlation between striving to get a higher education and alcohol consumption, sinse almost every student wants to get a higher education.

Plot for Hypothesis 4:

Almost the same results as for Hypothesis 1: there seems no correlation between having Internet and alcohol consumption. Only 15 students who don’t have Internet at home drink alcohol a lot, but there are 97 who drink a lot of alcohol and have Internet. However, if we count the percentage, it turns out that 26% of those without Internet drink alcohol and the corresponding amount of the students with Internet is 30%. The same filter (as in Hypothesis 1) was used to count the numbers:

i<-nrow(filter(matpor,internet=="yes",high_use==TRUE))

Logistic regression model

m <- glm(high_use ~ activities + famrel + internet + higher, data = matpor, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ activities + famrel + internet + higher, 
##     family = "binomial", data = matpor)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3167  -0.8643  -0.7642   1.3863   1.8870  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept)     0.9163     0.7070   1.296   0.1950  
## activitiesyes  -0.2410     0.2302  -1.047   0.2952  
## famrel         -0.2893     0.1213  -2.385   0.0171 *
## internetyes     0.2733     0.3297   0.829   0.4071  
## higheryes      -0.8244     0.4939  -1.669   0.0951 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 451.26  on 377  degrees of freedom
## AIC: 461.26
## 
## Number of Fisher Scoring iterations: 4
coef(m)
##   (Intercept) activitiesyes        famrel   internetyes     higheryes 
##     0.9162589    -0.2409905    -0.2893345     0.2732788    -0.8244081
OR <- exp(coef(m))
OR
##   (Intercept) activitiesyes        famrel   internetyes     higheryes 
##     2.4999204     0.7858491     0.7487617     1.3142666     0.4384945
  1. Each one-unit change in famrel makes high_use less likely by 28.9%
  2. If the student has any activities, the possibility that he has high alcohol consumption is reduced by 24%
  3. If the student is planning to get a higher education, the possibility that he has high alcohol consumption is reduced by 82%
  4. If the student has internet at home, then the odds of him/her drinking a lot is increased by 27.3%

Probabilities

##         prediction
## high_use FALSE TRUE
##    FALSE   258   10
##    TRUE    109    5

The prediction is wrong in 119 cases. Only 5 cases are true positive as TRUE, but 258 cases are true positive as FALSE.

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

loss_func(alc$high_use,alc$probability)
## [1] 0.3115183
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

cv$delta[1]
## [1] 0.3089005

The number of the wrong predictions on the testing data is the same as on the traning data.


Clustering and classification

Data wrangling exercises are done in the corresponding R script.

Analysis exercises

Firstly, the Boston dataset is loaded from the MASS package:

library(MASS)
library(dplyr)
data(Boston)
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

The data has 14 variables, all of them are either numeric or integer. Variables include e.g. crime rate by town per capita (crim), nitrogen oxides concentration (nox), index of accessibility to radial highways (rad), pupil-teacher ratio by town (ptratio), average number of rooms per dwelling (rm) and others.

summary(Boston)
##       crim                zn             indus            chas              nox        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000   Min.   :0.3850  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000   1st Qu.:0.4490  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000   Median :0.5380  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917   Mean   :0.5547  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000   3rd Qu.:0.6240  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000   Max.   :0.8710  
##        rm             age              dis              rad              tax       
##  Min.   :3.561   Min.   :  2.90   Min.   : 1.130   Min.   : 1.000   Min.   :187.0  
##  1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100   1st Qu.: 4.000   1st Qu.:279.0  
##  Median :6.208   Median : 77.50   Median : 3.207   Median : 5.000   Median :330.0  
##  Mean   :6.285   Mean   : 68.57   Mean   : 3.795   Mean   : 9.549   Mean   :408.2  
##  3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188   3rd Qu.:24.000   3rd Qu.:666.0  
##  Max.   :8.780   Max.   :100.00   Max.   :12.127   Max.   :24.000   Max.   :711.0  
##     ptratio          black            lstat            medv      
##  Min.   :12.60   Min.   :  0.32   Min.   : 1.73   Min.   : 5.00  
##  1st Qu.:17.40   1st Qu.:375.38   1st Qu.: 6.95   1st Qu.:17.02  
##  Median :19.05   Median :391.44   Median :11.36   Median :21.20  
##  Mean   :18.46   Mean   :356.67   Mean   :12.65   Mean   :22.53  
##  3rd Qu.:20.20   3rd Qu.:396.23   3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :22.00   Max.   :396.90   Max.   :37.97   Max.   :50.00

rad (index of accessibility ro radial highways) varies from 1 to 24. lstat (lower status of the population (in percents)) has several outliers: min lstat is 1.73%, max lstat is 37.97 %, and the mean is 12.65%. There is also a noticeable outlier in crim variable: max crim = 87.98. An outlier is also present in black variable (1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town): min black = 0.32.

pairs(Boston)

There is a clear hyperbolic relationship between nox and dis (weighted mean of distances to five Boston employment centres) variables: the more the concentration of nitrogen oxides, the less the weighted mean of distances to employment centers. Thus, we can say that there is a correlation between nitrogen oxides and employments centers.

There is also a hyperbolic relationship between lstat and medv (median value of owner-occupied homes in $1000s) variables: the lower the status of the population, the less the median cost of the homes in the area, which is pretty logical.

There is almost a linear correlation between rm (average number of rooms per dwelling) and lstat variables: the more rooms in the dwelling, the more the median cost of the homes and vice versa.

Now the dataset will be standardized and a new variable will be added to the dataset (the old variable crim will be dropped):

boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
#creating a new variable
breakpoints <- quantile(boston_scaled$crim)
labels <- c('low','med_low','med_high','high')
crime <- cut(boston_scaled$crim, breaks = breakpoints, include.lowest = TRUE, label=labels)

#dropping and adding variables
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)

Diving the data to train and test data:

n <- nrow(boston_scaled)
random_rows <- sample(n,  size = n * 0.8)
train_data <- boston_scaled[random_rows,]
test_data <- boston_scaled[-random_rows,]

Linear discriminant analysis

Fitting and plotting the LDA:

lda.fit <- lda(crime ~ ., data = train_data)
classes <- as.numeric(train_data$crime)
plot(lda.fit, dimen = 2,col=classes,pch=classes)

Predicting the classes with the LDA model:

correct_classes <- test_data[,"crime"]
test_data <- dplyr::select(test_data, -crime)

lda.pred <- predict(lda.fit, newdata = test_data)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       22       5        1    0
##   med_low    6      13        8    0
##   med_high   0       5       17    0
##   high       0       0        0   25

Overall results show that high crime rate was predicted correctly (there is only one case when medium high crime rate was predicted wrong as high). Low crime rate was predicted correctly for 17 cases, for 13 cases it was predicted as med_low and for 2 - as med_high. Medium low crime rate was predicted correctly 15 times with only 6 errors.

Calculating the distances and visualizing the clusters:

library(MASS)
data('Boston')
boston_scaled_again <- scale(Boston)

dist_eu <- dist(boston_scaled_again)   

km <-kmeans(Boston, centers = 2)
pairs(Boston, col = km$cluster)

The optimal number of clusters is 2: 3 look good already, but one of them (the black one) doesn’t seem to be of great significance. More than 3 clusters is abundant. In case of rad, tax and ptratio the red cluster clearly shows outliers. There are overlapping clusters in lstat and medv plot: one is corresponding to the bigger amount of people who are of lower status and to the smaller median cost of homes, and the other is corresponding to the bigger amount of people with higher status and to bigger median cost of homes (basically, it is poverty vs richness dichotomy)

Bonus Task:

model_predictors <- dplyr::select(train_data, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

install.packages("plotly")
## Error in install.packages : Updating loaded packages
library("plotly")

#colors from the crime classes
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers',color=train_data$crime)